Big Entropy and the Generalized Linear Model

library(here)
source(here::here("code/scripts/source.R"))
slides_dir = here::here("docs/slides/L11")
We'll move conceptually at a slow rate, which will set up a bunch of different models for this week and next.

We’ll move conceptually at a slow rate, which will set up a bunch of different models for this week and next.

Imagine you have buckets equidistant from you. At your feet you have 100 pebbles, each painted with a number. Unique pebbles.

Imagine you have buckets equidistant from you. At your feet you have 100 pebbles, each painted with a number. Unique pebbles.

What happens when we toss pebbles one at a time into the buckets at random. Eventually all 100 pebbles end up in the buckets, and you count them, and you get a distribution of pebbles. What types of distributions are really common, and what types are really rare?

What happens when we toss pebbles one at a time into the buckets at random. Eventually all 100 pebbles end up in the buckets, and you count them, and you get a distribution of pebbles. What types of distributions are really common, and what types are really rare?

Think about extreme distributions first. There's only 1 way to get all 100 pebbles in bucket 1.

Think about extreme distributions first. There’s only 1 way to get all 100 pebbles in bucket 1.

Same with bucket 5. So there are 5 unique distributions with all pebbles in a single bucket.

Same with bucket 5. So there are 5 unique distributions with all pebbles in a single bucket.

There are a bunch of distributions that will happen in a bunch of different ways. We could take a pebble from bucket 2 and swap it with one from bucket 3. How many ways could you get the same distribution. This very problem is the basis of Bayesian inference. Some distributions can arise in many more ways. It's a principle called Maximum Entropy, and it justifies Bayesian inference.

There are a bunch of distributions that will happen in a bunch of different ways. We could take a pebble from bucket 2 and swap it with one from bucket 3. How many ways could you get the same distribution. This very problem is the basis of Bayesian inference. Some distributions can arise in many more ways. It’s a principle called Maximum Entropy, and it justifies Bayesian inference.

We can replace the integers with $n$s. In some point, you learned that there's a formula for the number of arrangements of the pebbles.

We can replace the integers with \(n\)s. In some point, you learned that there’s a formula for the number of arrangements of the pebbles.

This is called the multiplicity. It's the foundation of statistical inference. It gets big really fast when the Ns get equal.

This is called the multiplicity. It’s the foundation of statistical inference. It gets big really fast when the Ns get equal.

Only one way to get all the pebbles in bucket 3.

Only one way to get all the pebbles in bucket 3.

How many ways to get the second distribution?

How many ways to get the second distribution?

It's massively bigger. This will accelerate. People have really bad intuitions regarding combinatorics.

It’s massively bigger. This will accelerate. People have really bad intuitions regarding combinatorics.

Now we've got two in bucket 2. Now we're getting an order of magnitude increase.

Now we’ve got two in bucket 2. Now we’re getting an order of magnitude increase.

General principle: Distributions that are flat can happen in many many more ways. And this is why we bet on them. They have high entropy. Flat distributions are closer, less surprised when the distribution turns out to be different. Then become really good foundations for statistical inference, because they distribute the possibilities as widely as possible.

General principle: Distributions that are flat can happen in many many more ways. And this is why we bet on them. They have high entropy. Flat distributions are closer, less surprised when the distribution turns out to be different. Then become really good foundations for statistical inference, because they distribute the possibilities as widely as possible.

This is a unique way to derive the formula. It's nothing more than the multiplicity. W is the multiplicity (number of ways to get the N). Then we've normalised it across the number of the pebbles. And that turns out to be a good approximation. Information entropy is just the logarithm of the number of ways to realise a distribution. And it's maximised when the distribution is flat. And flatter distributions have higher entropy.

This is a unique way to derive the formula. It’s nothing more than the multiplicity. W is the multiplicity (number of ways to get the N). Then we’ve normalised it across the number of the pebbles. And that turns out to be a good approximation. Information entropy is just the logarithm of the number of ways to realise a distribution. And it’s maximised when the distribution is flat. And flatter distributions have higher entropy.

Most centrally associated with Jaynes. If you choose any other distribution to characterise your state of knowledge, you will be implicitly adding other information into your distribution. So if you lay out all the constraints, then solve for the distribution that's as flat as possible under those constraints, you do the best you possibly can. You're honestly characterising your ignorance.

Most centrally associated with Jaynes. If you choose any other distribution to characterise your state of knowledge, you will be implicitly adding other information into your distribution. So if you lay out all the constraints, then solve for the distribution that’s as flat as possible under those constraints, you do the best you possibly can. You’re honestly characterising your ignorance.

Lots of conceptual examples for. What is the information content of a prior distribution? It turns out that Bayesian updating is a special case of this principle. You can input the data as constraints, and you get the posterior distribution by solving the maximum entropy problem. High entropy is good because the distance from the truth is smaller. One way to thing about it is it's deflationary. No matter what happens, and even distrubtion is bound to arise. We put in a tiny sliver of scientific information in our model, and the rest we just bet on entropy.

Lots of conceptual examples for. What is the information content of a prior distribution? It turns out that Bayesian updating is a special case of this principle. You can input the data as constraints, and you get the posterior distribution by solving the maximum entropy problem. High entropy is good because the distance from the truth is smaller. One way to thing about it is it’s deflationary. No matter what happens, and even distrubtion is bound to arise. We put in a tiny sliver of scientific information in our model, and the rest we just bet on entropy.

Motivates forward to other distributions. If we're going to maximise this function, if all the $p$s are equal, they're highest. Sometimes there are constrants that prevent us from making the $p$s equal. What kind of constraints? Known mean or variance.

Motivates forward to other distributions. If we’re going to maximise this function, if all the \(p\)s are equal, they’re highest. Sometimes there are constrants that prevent us from making the \(p\)s equal. What kind of constraints? Known mean or variance.

This is actually what we did in Week 1. Shows that it's just counting.

This is actually what we did in Week 1. Shows that it’s just counting.

Under some set of constraints, the distributions we use are maximum entropy distributions. Exponential distributions used for scale. They have a very clear maxent constraint. If a parameter is non-negative real, and has some mean value, then the exponential contains only that information.

Under some set of constraints, the distributions we use are maximum entropy distributions. Exponential distributions used for scale. They have a very clear maxent constraint. If a parameter is non-negative real, and has some mean value, then the exponential contains only that information.

Larger family of geocentric linear models. We want to connect a linear model to a mean to the distribution. Unreasonably effective given how geocentric it is. We pick an outcome distribution, then model the parameters using weird things called links, whcih link the distribution to some model. Can do all kinds of fancy things with the same basic strategy. Often if you don't want to play this game, when you write it down, it'll turn out to be a linear model anyway. In most cases, you probably want a GLM.

Larger family of geocentric linear models. We want to connect a linear model to a mean to the distribution. Unreasonably effective given how geocentric it is. We pick an outcome distribution, then model the parameters using weird things called links, whcih link the distribution to some model. Can do all kinds of fancy things with the same basic strategy. Often if you don’t want to play this game, when you write it down, it’ll turn out to be a linear model anyway. In most cases, you probably want a GLM.

Distributions arise from natural processes. And resist histomancy. This doesn't make sense under any framework. You want to use knowledge of your constraints to figure it out. There's no statistical framework where the aggregate outcomes is going to have any particular distribution.

Distributions arise from natural processes. And resist histomancy. This doesn’t make sense under any framework. You want to use knowledge of your constraints to figure it out. There’s no statistical framework where the aggregate outcomes is going to have any particular distribution.

Going to build GLMs with these different outcome distributions. Just an extension of what you've already been doing. Exponential is everyone's favourite because it only has 1 parameter. Lambda is a rate, and the mean is 1/lambda. Generatively it can arise from a machine with a number of parts. If one part breaks, the whole thing stops working. A fruit fly is the same. Bunch of parts inside the washing machine, and each part has a chance of breaking at a particular time, the waiting time until the washing machine stops is exponentially distributed.

Going to build GLMs with these different outcome distributions. Just an extension of what you’ve already been doing. Exponential is everyone’s favourite because it only has 1 parameter. Lambda is a rate, and the mean is 1/lambda. Generatively it can arise from a machine with a number of parts. If one part breaks, the whole thing stops working. A fruit fly is the same. Bunch of parts inside the washing machine, and each part has a chance of breaking at a particular time, the waiting time until the washing machine stops is exponentially distributed.

If you count events arising from exponential distributions. Mortality rates of fruit flies is bionimal. Like coin flips. Each fly could or could not ascend. And the binomial is maxent.

If you count events arising from exponential distributions. Mortality rates of fruit flies is bionimal. Like coin flips. Each fly could or could not ascend. And the binomial is maxent.

Poisson. Two ways of thinking about it. If you have a binomially distributed variable, but the probabiity of success is low and there are lots of flies oserved over a long time.

Poisson. Two ways of thinking about it. If you have a binomially distributed variable, but the probabiity of success is low and there are lots of flies oserved over a long time.

If you think about the time to the event of the exponential - how long did you wait until the washing machine broke, if you start adding up that time, those waiting times are distributed like Gamma. Also maxent. e.g. age of onset of cancer, perhaps because there are a lot of cellular defence mechanisms, and all of them need to fail.

If you think about the time to the event of the exponential - how long did you wait until the washing machine broke, if you start adding up that time, those waiting times are distributed like Gamma. Also maxent. e.g. age of onset of cancer, perhaps because there are a lot of cellular defence mechanisms, and all of them need to fail.

If you get a Gamma with a really large mean, it converges to a Normal. But not the only way - all roads lead to normal. And it's hard to leave. So these are generative processes, based on the constraints. Doesn't mean that they're correct, but it's the betting part.

If you get a Gamma with a really large mean, it converges to a Normal. But not the only way - all roads lead to normal. And it’s hard to leave. So these are generative processes, based on the constraints. Doesn’t mean that they’re correct, but it’s the betting part.

Tide prediction engine. When we get to GLMs, the metaphor is very potent. It's a mechinical computer, and a part of it is the prediction of times, and then there's messy stuff at the bottom that's calculating the output. You're absolutely wedded to the prediction perspective. Hard to have intuition about the parameters. You want to understand the prediction space, and you understand the parameters by observing their effects on prediction.

Tide prediction engine. When we get to GLMs, the metaphor is very potent. It’s a mechinical computer, and a part of it is the prediction of times, and then there’s messy stuff at the bottom that’s calculating the output. You’re absolutely wedded to the prediction perspective. Hard to have intuition about the parameters. You want to understand the prediction space, and you understand the parameters by observing their effects on prediction.

Just need to think about before the data have arrived, you know things about the outcome variable. e.g. count variables are integers starting at 0, so there are no negative counts. So from the beginning you know things about them. That constrains the distributions before they arrive. Next week we'll move onto monsters because we glue together different models using links. Likhert scales are ordinal scales, but they're not numeric. What it takes to get from 1 to 2 might be different from what it takes to go from 2 to 3. Fight monsters by making monsters. Mixture models are super useful. Bear a lot of resemblance to multi-level models.

Just need to think about before the data have arrived, you know things about the outcome variable. e.g. count variables are integers starting at 0, so there are no negative counts. So from the beginning you know things about them. That constrains the distributions before they arrive. Next week we’ll move onto monsters because we glue together different models using links. Likhert scales are ordinal scales, but they’re not numeric. What it takes to get from 1 to 2 might be different from what it takes to go from 2 to 3. Fight monsters by making monsters. Mixture models are super useful. Bear a lot of resemblance to multi-level models.

Consider the Gaussian linear regression. It's super benign, and that's because it has a special property: the scientific measurement units and the parameter for the mean are the same.

Consider the Gaussian linear regression. It’s super benign, and that’s because it has a special property: the scientific measurement units and the parameter for the mean are the same.

The much more typical case is the binomial model. If you want to connect a linear model to $p$, it's a probability. Probability is unitless. They're divided out. But the outcome has counts. So now the units aren't the same, and we need something that connects the parameter to the outcome scale. We need some function to put in wehre the question mark is so that it obeys physics.

The much more typical case is the binomial model. If you want to connect a linear model to \(p\), it’s a probability. Probability is unitless. They’re divided out. But the outcome has counts. So now the units aren’t the same, and we need something that connects the parameter to the outcome scale. We need some function to put in wehre the question mark is so that it obeys physics.

We're going to wrap $p$ in some function which constraitns it. say there's some function we can apply to the probability so that it's linear in the outcome scale.

We’re going to wrap \(p\) in some function which constraitns it. say there’s some function we can apply to the probability so that it’s linear in the outcome scale.

Searching is hearder. OLS can be used, but can be fragile. We're just going to use MCMC because we don't want to worry about it.

Searching is hearder. OLS can be used, but can be fragile. We’re just going to use MCMC because we don’t want to worry about it.

One of the fun things is that suddenly all the varibles automatically interact with each others. Imagine you're trying to understand the habitat preferences of a reptile. If it gets really cold, probability of surivival is low, but hot they're fine. On the porobability scale, evenutally things get cold enough that you're dead no matter what. If any one varible will kill the lizxard, it doesn't matter what the other variables are doing. That's an interaction. No matter how much food you give it, it's going to die if it's really cold. You want your model to do this.

One of the fun things is that suddenly all the varibles automatically interact with each others. Imagine you’re trying to understand the habitat preferences of a reptile. If it gets really cold, probability of surivival is low, but hot they’re fine. On the porobability scale, evenutally things get cold enough that you’re dead no matter what. If any one varible will kill the lizxard, it doesn’t matter what the other variables are doing. That’s an interaction. No matter how much food you give it, it’s going to die if it’s really cold. You want your model to do this.

If you like to think about the rate of change in a linear regression, you take a partial slope. Do this with any GLM, and the chain rule kicks in. And you get a much less nice expression. In a logistic regression, that's the equation. If you take the partial derivative, you get this thing in teh right That's the rate of change.

If you like to think about the rate of change in a linear regression, you take a partial slope. Do this with any GLM, and the chain rule kicks in. And you get a much less nice expression. In a logistic regression, that’s the equation. If you take the partial derivative, you get this thing in teh right That’s the rate of change.

Let's move into doing some good work. We'll model some counts of events. What the Bionimal distriibution for? Counts of success out of trials. There's some constant expected value condtioinal on a set of predictor variables. Under those conditions the maxent distribution is binomial.

Let’s move into doing some good work. We’ll model some counts of events. What the Bionimal distriibution for? Counts of success out of trials. There’s some constant expected value condtioinal on a set of predictor variables. Under those conditions the maxent distribution is binomial.

The expected value is $np$. Note the variance is related to the expected value. In general, the Guassian is the only distrubiton where the mean and the variance are independent. With all others, if the mean gets big, so does the variance.

The expected value is \(np\). Note the variance is related to the expected value. In general, the Guassian is the only distrubiton where the mean and the variance are independent. With all others, if the mean gets big, so does the variance.

So we're going to plug a linear model and attach it to $p$.

So we’re going to plug a linear model and attach it to \(p\).

On the horizontal I have some predictor $x$. What are the log odds? The log of $p$.

On the horizontal I have some predictor \(x\). What are the log odds? The log of \(p\).

If you do this, there's a really nice mapping onto the probability scale, where x is linear on the log odds scale, and constrained to the (0,1) internval on teh probability scale. This arises from the maxent derivation of the binomial distribution. In machine learnign they call it the maxent classifier.

If you do this, there’s a really nice mapping onto the probability scale, where x is linear on the log odds scale, and constrained to the (0,1) internval on teh probability scale. This arises from the maxent derivation of the binomial distribution. In machine learnign they call it the maxent classifier.

Logit means 'log odds'. $p$ is the probaility scale.

Logit means ‘log odds’. \(p\) is the probaility scale.

It really is just log odds. If you measure stuff in odds, you can measure things really well. Log odds are just the log of the odds. That's linear. How do you get back to the linear scale? Solve for $p$.

It really is just log odds. If you measure stuff in odds, you can measure things really well. Log odds are just the log of the odds. That’s linear. How do you get back to the linear scale? Solve for \(p\).

This is the conventional way to link, because it has lots of good mathematical properties.

This is the conventional way to link, because it has lots of good mathematical properties.

For intuition, you want to relate the two scales. Horizontal is probability. Vertical is log-odds. Log odds 0 is equal chance. There's this compression effect, so you need some scale. Log odds of -1 is 1/4. This is really important for defining priors.

For intuition, you want to relate the two scales. Horizontal is probability. Vertical is log-odds. Log odds 0 is equal chance. There’s this compression effect, so you need some scale. Log odds of -1 is 1/4. This is really important for defining priors.

We use this thing because its the natural link within the probability formula. It arises naturally in the derivation of the distribution. Big and legitimate links. If you have a scientific model, you can derive the link automatically.

We use this thing because its the natural link within the probability formula. It arises naturally in the derivation of the distribution. Big and legitimate links. If you have a scientific model, you can derive the link automatically.

Example dataset. Imagien you're a chimp on the close side. If you pull a lever, it's expand out on both sides. There may or may not be food in both trays. If you pull the right, they other chimp will get the snack too. Interested in whetehr chimps care about this distinction. It's not enough to do the experiment. They might pull the right because there's more food there. One of the treatments is to remove the partner from the other end. Also chimpanzees are handed, so you have to adjust for that. BUt you want to know the differnce - do they pulll the prosocial option more if there's a chimp on the other end.

Example dataset. Imagien you’re a chimp on the close side. If you pull a lever, it’s expand out on both sides. There may or may not be food in both trays. If you pull the right, they other chimp will get the snack too. Interested in whetehr chimps care about this distinction. It’s not enough to do the experiment. They might pull the right because there’s more food there. One of the treatments is to remove the partner from the other end. Also chimpanzees are handed, so you have to adjust for that. BUt you want to know the differnce - do they pulll the prosocial option more if there’s a chimp on the other end.

Alone with no other chimp. Prosocial and asocial option is balanced across left and right. We want to predict the outcome as a function of the condition -the total treatment.

Alone with no other chimp. Prosocial and asocial option is balanced across left and right. We want to predict the outcome as a function of the condition -the total treatment.

Four possible distinct unordered treatments. Wnat to estimate the tendency to pull the lever. The linear model on teh left is the Binomial. $lpha$ measures handedness. Then we have a vector of four $\beta$ parameters$, one for each treatment. Note that the Bernoullli is just the Binomial with one trial.

Four possible distinct unordered treatments. Wnat to estimate the tendency to pull the lever. The linear model on teh left is the Binomial. \(lpha\) measures handedness. Then we have a vector of four \(\beta\) parameters$, one for each treatment. Note that the Bernoullli is just the Binomial with one trial.

How to do priors? They behave in GLMs in very unpredictable ways. So need to do prior simulation. Let's consider a skeletal verison of Bionmal regression where the linear model is some alpha, some intercept, the average log odds across all trials. What kind of prior to set on that. Let's set a Gaussian. Centered on a half. But what about the scale? What happens when you pick $\omega$.

How to do priors? They behave in GLMs in very unpredictable ways. So need to do prior simulation. Let’s consider a skeletal verison of Bionmal regression where the linear model is some alpha, some intercept, the average log odds across all trials. What kind of prior to set on that. Let’s set a Gaussian. Centered on a half. But what about the scale? What happens when you pick \(\omega\).

Let's try with $''omega = 10$.

Let’s try with \(''omega = 10\).

What happens is we have the prior proabability scale. THe black density curve is the prior hwere you assign alpha the normal 0,10. Because a Gaussian distribution has huge amount of mass beyond absolute 3. Most of the mass is outside the extremes. Because the range of the log-odds scale is -4,4. So when you change it to the probabilty scale, it puts a lot of probability in the tails. We can adopt this heuritsitc postiion of having something flat, which is normal with omega of 1.5

What happens is we have the prior proabability scale. THe black density curve is the prior hwere you assign alpha the normal 0,10. Because a Gaussian distribution has huge amount of mass beyond absolute 3. Most of the mass is outside the extremes. Because the range of the log-odds scale is -4,4. So when you change it to the probabilty scale, it puts a lot of probability in the tails. We can adopt this heuritsitc postiion of having something flat, which is normal with omega of 1.5

Next we'll talk about slopes.

Next we’ll talk about slopes.